{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# How to define TabularCPD and LinearGaussianCPD\n", "\n", "One can implement `TabularCPD` for discrete variables inside `DiscreteBayesianNetwork`.\n", "and `LinearGaussianCPD` for continuous variables inside `LinearGaussianBayesianNetwork`. \n", "\n", "In this tutorial, we will demonstrate how to define each CPD.\n", "\n", "## TabularCPD for discrete variables\n", "\n", "In tabular CPD, the probability for discrete variable is given as a table. Let us start with examples for independent variables." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from pgmpy.factors.discrete import TabularCPD\n", "\n", "cpd_coin_fair = TabularCPD(variable=\"coin\", variable_card=2, values=[[0.5], [0.5]], state_names={'Coin': ['Head', 'Tail']})\n", "cpd_coin_biased = TabularCPD(variable=\"coin\", variable_card=2, values=[[0.9], [0.1]], state_names={'Coin': ['Head', 'Tail']})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For a coin flip, we have discrete number (2) possible outcomes, therefore `variable_card=2` is passed. `values` pass probabilities for each outcome, which sum up to 1. `state_names` is optional, it gives names for each outcome." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "cpd_smoke = TabularCPD(variable=\"Smoker\", variable_card=2, values=[[0.3], [0.7]], state_names={'Smoker': ['Non-smoker', 'Smoker']})\n", "cpd_pollution = TabularCPD(variable=\"Pollution\", variable_card=3, values=[[0.7], [0.29], [0.01]], state_names={'Pollution': ['Clean', 'Bad', 'Fatal']})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `Pollution` and `Smoker` variables do not depend on other variables. They take categorical values. `variable_card` denotes how many categories the variable can take. For example `Smoker` can take a binary value since `variable_card=2`. `values` is an array of probability values for each category. Note that the probability `values` sum up to 1.\n", "\n", "### Tabular CPD with multiple evidence.\n", "\n", "We next consider another discrete variable `Cancer`, with a model assumption that it depends on `Pollution` and `Smoker`." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "cpd_cancer = TabularCPD(\n", " variable=\"Cancer\",\n", " variable_card=2,\n", " values=[[0.20, 0.15, 0.03, 0.05, 0.001, 0.02],\n", " [0.80, 0.85, 0.97, 0.95, 0.999, 0.98]],\n", " evidence=[\"Smoker\", \"Pollution\"],\n", " evidence_card=[2, 3],\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For `Cancer` variable, we pass another `TabularCPD`. `evidence_card` denotes cardinality of each evidence variable. There are total `2*3=6` different combinations of evidence. Note the values of `evidence` affects probability of `Cancer`. We have 6 columns in `values` denoting conditional probabilities. `values` sums up to 1 columnwise. The columns are partitioned by the first evidence first. The first three columns correspond to non-smokers.\n", "\n", "We next consider another discrete variable `D`, with a model assumption that it depends on `A`, `B` and `C`." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "cpd = TabularCPD(\n", " variable=\"D\",\n", " variable_card=2,\n", " values=[[0.20, 0.15, 0.93, 0.05, 0.001, 0.02, 0.10, 0.25 ],\n", " [0.80, 0.85, 0.07, 0.95, 0.999, 0.98, 0.90, 0.75 ]],\n", " evidence=[\"A\", \"B\", \"C\"],\n", " evidence_card=[2, 2, 2],\n", ")" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+------+------+------+------+------+-------+------+------+------+\n", "| A | A(0) | A(0) | A(0) | A(0) | A(1) | A(1) | A(1) | A(1) |\n", "+------+------+------+------+------+-------+------+------+------+\n", "| B | B(0) | B(0) | B(1) | B(1) | B(0) | B(0) | B(1) | B(1) |\n", "+------+------+------+------+------+-------+------+------+------+\n", "| C | C(0) | C(1) | C(0) | C(1) | C(0) | C(1) | C(0) | C(1) |\n", "+------+------+------+------+------+-------+------+------+------+\n", "| D(0) | 0.2 | 0.15 | 0.93 | 0.05 | 0.001 | 0.02 | 0.1 | 0.25 |\n", "+------+------+------+------+------+-------+------+------+------+\n", "| D(1) | 0.8 | 0.85 | 0.07 | 0.95 | 0.999 | 0.98 | 0.9 | 0.75 |\n", "+------+------+------+------+------+-------+------+------+------+\n" ] } ], "source": [ "print(cpd)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Each column of `values` correspond to a different combinations of `evidence` variables. If the evidence variables were lexicographic as above `(A, B, C)`, then the columns will be also lexicographic. \n", "\n", "```(A(0)B(0)C(0), A(0)B(0)C(1), A(0)B(1)C(0), A(0)B(1)C(1), ...)```\n", "\n", "It is as though we loop over A, and B, and C, in a nested order. \n", "\n", "### CPD with random values\n", "It is also possible to create a random tabular CPD." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[0.55203474],\n", " [0.44796526]])" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cpd_coin_random = TabularCPD.get_random(variable=\"coin\", cardinality = {'coin':2}, state_names={'Coin': ['Head', 'Tail']})\n", "cpd_coin_random.get_values()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note the probabilities can be retrieved by `get_values` and they add up to 1 again. If `cardinality` is missing, the discrete variable is assumed to be binary. \n", "\n", "When there are `evidence`, the cardinality of evidence needs to be also specified." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[0.30404579, 0.99344749, 0.47933983, 0.58564998, 0.54269003,\n", " 0.28164646],\n", " [0.69595421, 0.00655251, 0.52066017, 0.41435002, 0.45730997,\n", " 0.71835354]])" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cpd_cancer_random = TabularCPD.get_random(\n", " variable=\"Cancer\",\n", " cardinality = {\"Cancer\": 2, \"Smoker\": 2, \"Pollution\": 3},\n", " evidence=[\"Smoker\", \"Pollution\"],\n", ")\n", "cpd_cancer_random.get_values()" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([1., 1., 1., 1., 1., 1.])" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cpd_cancer_random.get_values().sum(axis = 0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note again the probability table sums up to 1 for each column.\n", "\n", "## LinearGaussianCPD for continuous variables\n", "\n", "We define CPDs for each variable. `LinearGaussianCPD` assumes that each variable takes normal distribution. \n", "Assume `Healthy` and `Wealthy` variables have no parents (empty evidence). " ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "